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(57) ABSTRACT 

A system and method for accurately mapping between 
camera coordinates and geo-coordinates, called geo -spatial 
registration, using a Euclidean model. The system utilizes 
the imagery and terrain information contained in the geo- 
spatial database to precisely align geographically calibrated 
reference imagery with an input image, e.g., dynamically 
generated video images, and thus achieve a high accuracy 
identification of locations within the scene. When a sensor, 
such as a video camera, images a scene contained in the 
geo-spatial database, the system recalls a reference image 
pertaining to the imaged scene. This reference image is 
aligned very accurately with the sensor's images using a 
parametric transformation produced by a Euclidean model. 
Thereafter, other information that is associated with the 
reference image can easily be overlaid upon or otherwise 
associated with the sensor imagery. 
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METHOD AND APPARATUS FOR 
PERFORMING GEO-SPATIAL 
REGISTRATION USING A EUCLIDEAN 
REPRESENTATION 

This non-provisional application claims the benefit of 
U.S. provisional application Ser. No. 60/141,460 filed Jun. 
29,1999, which is hereby incorporated herein by reference. 

This application contains subject matter related to U.S. 
patent application Ser. No. 09/075,462, filed May 8, 1998 
and incorporated herein by reference. 

The invention is generally related to image processing 
systems and, more specifically, to a method and apparatus 
for performing geo -spatial registration using a Euclidean 
representation within an image processing system. 

GOVERNMENT SUPPORT 

This invention was made with Government support under 
Subcontract No. K57S00006 to Samoff Corporation under 
Prime Contract No. -4206 awarded by the U.S. Department 
of the Air Force and Contract No. NMA202-97-D-1033 
awarded by the National Imagery and Mapping Agency. The 
Government has certain rights in this invention. 

BACKGROUND OF THE INVENTION 

The ability to locate scenes and/or objects visible in a 
video/image frame with respect to their corresponding loca- 
tions and coordinates in a reference coordinate system is 
important in visually-guided navigation, surveillance and 
monitoring systems. Aerial video is rapidly emerging as a 
low cost, widely used source of imagery for mapping, 
surveillance and monitoring applications. The individual 
images from an aerial video can be aligned with one another 
and merged to form an image mosaic that can form a video 
map or provide the basis for estimating motion of objects 
within a scene. One technique for forming a mosaic from a 
plurality of images is disclosed in U.S. Pat. No. 5,649,032, 
issued Jul. 15, 1997, which is hereby incorporated herein by 
reference. 

To form a "video map", a mosaic (or mosaics) of images 
may be used as a database of reference imagery and asso- 
ciated "geo-coordinates" (e.g., latitude/longitude within a 
reference coordinate system) are assigned to positions 
within the imagery. The geo-coordinates (or other image or 
scene attributes) can be used to recall a mosaic or portion of 
a mosaic from the database and display the recalled imagery 
to a user. Such a searchable image database, e.g., a video 
map, is disclosed in U.S. patent application Ser. No. 08/970, 
889, filed Nov. 14, 1997, and hereby incorporated herein by 
reference. 

A system that images a scene that has been previously 
stored in the reference database and recalls the reference 
information in response to the current images to provide a 
user with information concerning the scene would have 
applicability in many applications. For example, a camera 
on a moving platform imaging a previously imaged scene 
contained in a database may access the database using the 
coordinates of the platform. The system provides scene 
information to a user. However, a key technical problem of 
locating objects and scenes in a reference mosaic with 
respect to their geo-coordinates needs to be solved in order 
to ascertain the geo-location of objects seen from the camera 
platform's current location. In current systems for geo- 
location, the mapping of camera coordinates to the geo- 
coordinates, use position and attitude information for a 
moving camera platform within some fixed world coordi- 



17,601 Bl 

2 

nates to locate the video frames in the reference mosaic 
database. However, the accuracy achieved is only on the 
order of tens to hundreds of pixels. This inaccuracy is not 
acceptable for high resolution mapping. 
s Therefore, there is a need in the art for a method and 
apparatus that identifies a location within an imaged scene 
with a sub-pixel accuracy direcdy from the imagery within 
the scene itself. 

10 SUMMARY OF THE INVENTION 

The disadvantages of the prior art are overcome by the 
present invention of a system and method for accurately 
mapping between camera coordinates and geo-coordinates, 
called geo-spatial registration, using a Euclidean model to 
align and combine images. The present invention utilizes the 
imagery and terrain information contained in the geo-spatial 
database to precisely align the reference imagery with input 
imagery, such as dynamically generated video images or 
video mosaics, and thus achieve a high accuracy ideotifica- 

20 tion of locations within the scene. The geo-spatial reference 
database generally contains a substantial amount of refer- 
ence imagery as well as scene annotation information, 
digital elevation maps (DEM, and object identification infor- 
mation. When a sensor, such as a video camera, images a 

25 scene contained in the geo-spatial database, the system 
recalls a reference image and DEM pertaining to the imaged 
scene. This reference image is aligned very accurately with 
the sensor's images using a parametric transformation 
derived from a Euclidean model. Thereafter, other informa- 

30 tion (annotation, sound, and the like) that is associated with 
the reference image can easily be overlaid upon or otherwise 
associated with the sensor imagery. Applications of geo- 
spatial registration include text/graphical/audio annotations 
of objects of interest in the current video using the stored 

35 annotations in the reference database to augment and add 
meaning to the current video. These applications extend 
beyond the use of aerial videos into the challenging domain 
of video/image-based map and database indexing of arbi- 
trary locales, like cities and urban areas. 

40 ' 

BRIEF DESCRIPTION OF THE DRAWINGS 

The teachings of the present invention can be readily 
understood by considering the following detailed descrip- 
45 tion in conjunction with the accompanying drawings, in 
which: 

FIG. 1 depicts a conceptual view of a system incorporat- 
ing the present invention; 

FIG. 2 depicts a functional block diagram of the geo- 
50 registration system of the present invention; 

FIG. 3 depicts a functional block diagram of the coarse 
alignment block of the system in FIG. 2; 

FIG. 4 depicts a flow diagram of the fine alignment block 
S5 of FIG. 2; and 

FIG. 5 depicts a flow diagram of a method for aligning 
images using a Euclidean model. 

To facilitate understanding, identical reference numerals 
have been used, where possible, to designate identical 
60 elements that are common to the figures. 

DETAILED DESCRIPTION 

FIG. 1 depicts a conceptual view of a comprehensive 
system 100 containing a geo- registration system 106 of the 
65 present invention. The figure shows a mobile platform 102 
dynamically capturing "current 5 * video images of a scene at 
a specific locale 104 within a large area 108. The system 106 
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identifies information in a reference database 110 that per- 
tains to the current video images being transmitted along 
path 112 to the system 106. The system 106 "geo-registers" 
the current video images to the reference information or 
imagery stored within the reference database 110, i.e., the 5 
current video is aligned with geographically calibrated ref- 
erence imagery and information using a Euclidean model. 
To facilitate the alignment process, the reference informa- 
tion generally contains a digital elevation map (DEM) that 
may have a value of zero, a constant, a complex polynomial, 10 
and so on. The more complex the DEM the more accurate 
the alignment results of the geo-registration process. After 
"geo -registration", the footprints of the current video are 
shown on a display 114 to a user overlaid upon the reference 
imagery or other reference annotations. As such, reference 15 
information such as latitude/longitude/height of points of 
interest are retrieved from the database and are overlaid on 
the relevant points on the current video. Consequently, the 
user is provided with a comprehensive understanding of the 
scene that is being imaged. 2 o 

The system 106 is generally implemented by executing 
one or more programs on a general purpose computer 126. 
The computer 126 contains a central processing unit (CPU) 
116, a memory device 118, a variety of support circuits 122 
and input/output devices 124. The CPU 116 can be any type 2 s 
of high speed processor such as a PENTIUM II manufac- 
tured by Intel Corporation or a POWER PC manufactured 
by Motorola Inc. The support circuits 122 for the CPU 116 
include conventional cache, power supplies, clock circuits, 
data registers, I/O interfaces and the like. The I/O devices 30 
124 generally include a conventional keyboard, mouse, and 
printer. The memory device 118 can be random access 
memory (RAM), read-only memory (ROM), hard disk 
storage, floppy disk storage, compact disk storage, or any 
combination of these devices. The memory device 118 stores 35 
the program or programs (e.g., geo-registration program 
120) that are executed to implement the geo-registration 
technique of the present invention. When the general pur- 
pose computer executes such a program, it becomes a 
special purpose computer, i.e., the computer becomes an 4 q 
integral portion of the geo-registration system 106. Although 
the invention has been disclosed as being implemented as an 
executable software program, those skilled in the art will 
understand that the invention may be implemented in 
hardware, software or a combination of both. Such imple- 45 
mentations may include a number of processors indepen- 
dently executing various programs and dedicated hardware 
such as application specific integrated circuits (ASICs). 

FIG. 2 depicts a functional block diagram of the geo- 
registration system 106 of the present invention. 50 
Illustratively, the system 106 is depicted as processing a 
video signal as an input image; however, from the following 
description those skilled in the art will realize that the input 
image (referred to herein as input imagery) can be any form 
or image including a sequence of video frames, a sequence ss 
of still images, a still image, a mosaic of images, a portion 
of an image mosaic, and the like. In short, any form of 
imagery can be used as an input signal to the system of the 
present invention. 

The system 106 comprises a video mosaic generation 60 
module 200 (optional), a geo-spatial aligning module 202, a 
reference database module 204, and a display generation 
module 206. Although the video mosaic generation module 
200 provides certain processing benefits that shall be 
described below, it is an optional module such that the input 65 
imagery may be applied directly to the geo-spatial aligning 
module 202. When used, the video mosaic generation mod- 
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ule 200 processes the input imagery by aligning the respec- 
tive images of the video sequence with one another to form 
a video mosaic. The aligned images are merged into a 
mosaic. A system for automatically producing a mosaic from 
a video sequence is disclosed in U.S. Pat. No. 5,649,032, 
issued Jul. 15, 1997, and incorporated herein by reference. 

The reference database module 204 provides geographi- 
cally calibrated reference imagery and information 
(including DEM) that is relevant to the input imagery. The 
camera platform (102 in FIG. 1) provides certain attitude 
information that is processed by the engineering support 
data (ESD) module 208 to provide indexing information that 
is used to recall reference images (or portions of reference 
images) from the reference database module 204. A portion 
of the reference image that is nearest the video view (i.e., has 
a similar point-of-view of a scene) is recalled from the 
database and is coupled to the geo-spatial aligning module 
202. The module 202 first warps the reference image to form 
a synthetic image having a point-of-view that is similar to 
the current video view, then the module 202 accurately 
aligns the reference information with the video mosaic. The 
alignment process is accomplished in a coarse -to-fine man- 
ner as described in detail below. The transformation param- 
eters that align the video and reference images are provided 
to the display module 206. Using these transformation 
parameters, the original video can be accurately overlaid on 
the reference information, or vice versa, to produce a com- 
prehensive view of the scene. 

To obtain input imagery that can be indexed and aligned 
with respect to geographically calibrated reference imagery 
and information, as mentioned above, a "video mosaic" 
representing an imaged scene is produced to remove redun- 
dant information in the video image sequence. Video frames 
are typically acquired at 30 frames per second and contain 
a substantial amount of frame-to-frame overlap. For typical 
altitudes and speeds of airborne platforms, the overlap 
between adjacent frames may range from 4/5 to 49/50th of 
a single frame. Therefore, conversion of video frames into 
video mosaics is an efficient way to handle the amount of 
information contained in the incoming video stream. The 
invention exploits the redundancy in video frames by align- 
ing successive video frames using low order parametric 
transformations such as translation, afiine and projective 
transformations. The frame-to-frame alignment parameters 
enable the creation of a single extended view mosaic image 
that authentically represents all the information contained in 
the aligned input frames. For instance, typically 30 frames of 
standard NTSC resolution (720x480) containing about ten 
million pixels may be reduced to a single mosaic image 
containing only about two-hundred thousand to two million 
pixels depending on the overlap between successive frames. 
The video mosaic is subsequently used for geo-referencing 
and location. 

Although many alignment algorithms are available that 
achieve image alignment of video imagery, the present 
invention uses a projective transformation to align the 
images. Additionally, the mosaicing process is extended to 
handle unknown lens distortion present in the imagery. 
Exemplary alignment processing for images (video images, 
in particular) is disclosed in U.S. Pat. No. 5,649,032. The 
result is a video mosaic representing the information con- 
tained in the sequence of video frames with any redundant 
information removed from the mosaic. 

Often in aerial video streams, the lens distortion param- 
eters must be explicitly modeled in the estimation process. 
A fundamental assumption made in the earlier work on 
mosaicing was that one image could be chosen as the 
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reference image and the mosaic would be constructed by Navigation System (INS) information, image scale, attitude, 

merging all other images to this reference image. The video rotation, and the like, that is extracted from the signal 

mosaic generation module 200 extends the direct estimation received from the platform and provided to the geo -spatial 

algorithms of the prior art to use a reference coordinate aligning module 202 as well as the database module 204. 

system but not a reference image. The module 200 computes 5 Specifically, the ESD information is generated by the ESD 

the motion parameters that warp all images to a virtual generation module 208. The ESD is used as an initial scene 

image mosaic in this reference coordinate system. Each identifier and sensor point-of-view indicator. As such, the 

pixel in this virtual image mosaic is predicted by intensities ggrj ^ coupled t0 mc reference database module 204 and 

from more than one image. An error measure is minimized ^ tQ recaU database information that is relevant to the 

over the virtual image to compensate for lens distortion. The 3Q cmcn{ y[dcQ imagcry- Moreover, the ESD can be 

error measure may be the sum of the variances or the sum used tQ maintain coarse a ii gnm ent between subsequent 

of the predicted pixel intensities at each pixel location. U.S. video framcs ovcr regions of the where there ^ mtlc 

patent application Ser. No. 08/966,776, filed Nov. 10, 1997 or Q0 image textUfe that can be used t0 accurately align the 

and incorporated herein by reference, discloses the details of mo&aic with the reference image . 

a lens distortion compensation procedure. Although the lens J5 More specificallVj me ESD that is supplied from the 

distortion compensation procedure does not require a refer- sensor platform along with me video ^ generally encoded 

ence image, the present invention uses a reference image in md requires deC oding to produce useful information for the 

the alignment process and this reference imagery can also be geo _ S p at iai aligning module 202 and the reference database 

used to facilitate lens distortion compensation. module 204. Using the ESD generation module 208, the 

In order to compute the correspondences of the video 20 £s D is extracted or otherwise decoded from the signal 

frames and the unknown parameters simultaneously, the produced by the camera platform to define a camera model 

invention uses an error function that minimizes the variance (position and attitude) with respect to the reference database, 

in intensities of a set of corresponding points in the images, 0 f coursC) this does not mean that the camera platform and 

that map to the same ideal reference coordinate. Formally, system can not be collocated, i.e., as in a hand held system 

the unknown projective transformation parameters for each 25 a Dui i t ^ ^^q^ but means merely that the position and 

frame, A 1 . . . A^, and the lens distortion parameter, y x are attitude information of the current view of the camera is 

solved using Equation 1. necessary. 

Given that ESD, on its own, can not be reliably utilized to 

Z i y (Hpi) _ 7(p)) 2 (O associate objects seen in videos (i.e., sensor imagery) to their 

M(p)Zj 3 q corresponding geo-locations, the present invention utilizes 

p the precision in localization afforded by the alignment of the 

rich visual attributes typically available in video imagery to 

where point p ; in frame i is a transformation of a point p in achieve exceptional alignment rather than use ESD alone, 

the reference coordinate system, I(p) is the mean intensity For aerial surveillance scenarios, often a reference image 

value of all the p°s that map to p, and M(p) is a count of all 35 database in geo-coordinates along with the associated DEM 

such p"s. Therefore, given a point p in the reference maps and annotations is readily available. Using the camera 

coordinates, each term in the summation over i in Equation model, reference imagery is recalled from the reference 

1 is the variance of all the intensity values at points p 1 " that image database. Specifically, given the camera's general 

map to point p. position and attitude, the database interface recalls imagery 

In geo-spatial registration scenarios, the look angles of the 40 (one or more reference images or portions of reference 

imaging platform with respect to the Earth may be known images) from the reference database that pertains to that 

with varying degrees of precision. The knowledge of these particular view of the scene. Since the reference images 

angles and other engineering support data (ESD) can be used generally are not taken from the exact same perspective as 

to correct for oblique look angles in order to generate a nadir the current camera perspective, the camera model is used to 

view, i.e., use a process for ortho-correction. After perform- 45 apply a perspective transformation (i.e., the reference 

ing ortho-correction, video mosaics may be created as images are warped) to create a set of synthetic reference 

described above. Ortho-corrected mosaics have the advan- images from the perspective of the camera, 

tage that the view in an orthographic coordinate system is The reference database module 204 contains a geo-spatial 

similar to that available in orthographic photographs. feature database 210, a reference image and digital evalua- 

Depending on the imaging scenario, ortho-corrected 50 tion map (DEM) database 212, and a database search engine 

video mosaicing may have to account for the effects of 214. The geo-spatial feature database 210 generally contains 

parallax. The processing involved has to use the three- feature and annotation information regarding various fea- 

dimensional parallax present in the scene along with the hires of the images within the image database 212. The 

warping transformation that accounts for the oblique angles reference database 212 contains images (which may include 

of the camera platform. To account for parallax, the inven- 55 mosaics) and DEMs of a scene. The two databases are 

tion can use one of two approaches: (1) warp the imagery coupled to one another through the database search engine 

using any pre-existing Digital Elevation Map (DEM) infor- 214 such that features contained in the images of the 

mation contained in the database or (2) account for parallax reference database 212 have corresponding annotations in 

by computing the parallax using multiple images of the the feature database 210. Since the relationship between the 

scene. Parallax computation from multiple video images and 60 annotation/feature information and the reference informa- 

its use in the creation of parallax-corrected mosaics is tion is known, the annotation/feature information can be 

disclosed in commonly assigned U.S. patent application Ser. aligned with the video images using the same parametric 

No. 08/493,632, filed Jun. 22, 1995 and incorporated herein transformation that is derived to align the reference images 

by reference. to the video mosaic. 

In addition to image information, the sensor platform (102 65 The database search engine 214 uses the ESD to select a 

in FIG. 1) also provides engineering support data (ESD), reference image and DEM or a portion of a reference image 

e.g., global positioning system (GPS) information, Inertial and DEM in the reference image and DEM database 204 that 
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most closely approximates the scene contained in the video, block 216 and/or the fine alignment block 222 could be 

If multiple reference images and DEMs of that scene are bypassed for a number of video frames. The alignment 

contained in the reference image and DEM database 212, the parameters can generally be estimated using frame-to -frame 

engine 214 will select the reference image and DEM having motion such that the alignment parameters need only be 

a viewpoint that most closely approximates the viewpoint of 5 computed infrequently. 

the camera producing the current video. The selected refer- FIG. 3 depicts a functional block diagram of the coarse 

ence image and DEM is coupled to the geo-spatial aligning alignment block 216 which contains a video mosaic salient 

module 202. feature extractor 300, a reference image salient feature 

The geo-spatial aligning module 202 contains a coarse extractor 302, an exhaustive search engine 304, and a 

alignment block 216, a synthetic view generation block 218, 10 directed matching processor 306. The coarse indexing pro- 

a tracking block 220 and a fine alignment block 222. The cess locates a video mosaic within a reference image using 

synthetic view generation block 218 uses the ESD and visual appearance features. In principle, one could exhaus- 

reference DEM to warp a reference image to approximate tively correlate the intensities in the video mosaic and the 

the viewpoint of the camera generating the current video that reference image at each pixel and find the best match, 

forms the video mosaic. These synthetic images form an 15 However, due to the uncertainties in viewpoint defined by 

initial hypothesis for the geo-location of interest that is ESD and due to real changes in appearance between the 

depicted in the current video data. The initial hypothesis is reference imagery and the current video, it may not be 

typically a section of the reference imagery warped and possible to directly correlate intensities in the two images, 

transformed so that it approximates the visual appearance of The real changes in appearance may be due to change of 

the relevant locale from the viewpoint specified by the ESD, 20 reflectance of objects and surfaces in the scene (e.g., summer 

The alignment process for aligning the synthetic view of to fall, shadows and the like) and due to difference in 

the reference image with the input imagery (e.g., the video illumination between the reference and the video imagery, 

mosaic produced by the video mosaic generation module Changes in appearance due to viewpoint are accounted for 

200, the video frames themselves that are alternatively to a large extent by the process of warping the reference 

coupled from the input to the geo-spatial aligning module 25 image to the ESD defined viewpoint. However, for robust 

202 or some other source of input imagery) is accomplished matching and localization, indexing and matching must be 

using two steps. A first step, performed in the coarse resilient to uncertainties in ESD and to real changes in the 

alignment block 216, coarsely indexes the video mosaic and imagery. 

the synthetic reference image to an accuracy of a few pixels. The coarse alignment block 216 computes features at 
In some instances, the first step of coarse alignment is not 30 multiple scales and multiple orientations that are invariant or 
used and the invention only performs fine alignment. As quasi-invariant to changes in viewpoint. To facilitate such 
such, the video mosaic generation module 200 is coupled multiple scale computation, the reference images may be 
directly to the fine alignment block 222. stored as image pyramids or image pyramids may be com- 
Asecond step, performed by the fine alignment block 222, puted when the reference image is recalled from the data- 
accomplishes fine alignment to accurately register the syn- 35 base. In any event, the reference image scale and resolution 
thetic reference image and video mosaic with a sub-pixel should be comparable to that of the video mosaic. To achieve 
alignment accuracy without performing any camera calibra- flexibility, the salient feature extractors 300 and 302 may 
tion. The fine alignment block 222 achieves a sub-pixel both contains image pyramid generators such that both the 
alignment between the images using a Euclidean model. The video mosaic and the reference image are decomposed into 
Euclidean model is one of a number of models that is 40 image pyramids to facilitate rapid and accurate salient 
selected by a model selector 252. The model selection feature extraction. 

process is described with respect to FIG. 4 below. The output Whether operating upon a full video mosaic and reference 

of the geo-spatial alignment module 202 is a parametric image or a level of a pyramid from the two images, the 

transformation that defines the relative positions of the salient feature extractors 300 and 302 compute many salient 

reference information and the video mosaic. This parametric 45 locations both in the reference and video imagery. Such 

transformation is then used to align the reference informa- salient feature detection and processing is disclosed in T. 

tion with the video such that the annotation/features infor- Lindeberg, "Detecting Salient Blob-like Image Structures 

mation from the feature database 210 are overlaid upon the and Their Scales with a Scale-space Primal Sketch: A 

video or the video can be overlaid upon the reference images Method for Focus-of-attention " International Journal of 

or both. In essence, accurate localization of the camera 50 Computer Vision, 1994. The salient feature locations are 

position with respect to the geo-spatial coordinate system is determined automatically based on distinctiveness of local 

accomplished using the video content. image structure, i.e., the salient features may be low fre- 

Finally, the tracking block 220 updates the current esti- quency blobs, high frequency comers or edges, or some 

mate of sensor attitude and position based upon results of combination of these features. The features that are consid- 

matching the sensor image to the reference information. As 55 ered salient depend on the scale and resolution of the 

such, the sensor model is updated to accurately position the imagery being processed. Even with the feature representa- 

sensor in the coordinate system of the reference information. lions at salient locations only, there may be too much data 

This updated information is used to generate new reference for exhaustive matching of the salient features. Therefore, in 
images to support matching based upon new estimates of the exhaustive search engine 304, fast indexing of the 

sensor position and attitude and the whole process is iterated 60 multi-dimensional visual features is used to eliminate most 

to achieve exceptional alignment accuracy. Consequently, of the false matches, i.e., the salient features are pruned, 

once initial alignment is achieved and tracking commenced, Subsequently, the directed matching processor 306 performs 

the geo-spatial alignment module may not be used to com- directed matching of the small set of remaining candidate 

pute the parametric transform for every new frame of video matches which leads to the correct coarse location of the 
information. For example, fully computing the parametric 65 video imagery in the reference coordinate system. The 

transform may only be required every thirty frames (i.e., directed matching is performed using a "data tree" process 
once per second). Once tracking is achieved, the indexing that is disclosed in U.S. Pat No. 5,159,647, issued Oct. 27, 
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1992, and incorporated herein by reference. The output of Informally, the transformation from the reference coordi- 

the coarse alignment block 216 is a set of coarse parameters nates to the image coordinates can be understood as follows, 

for a parametric transform that aligns the reference image to A pixel x w the world coordinate system having known height 

the video mosaic. information is rotated by R about the center of the world 

Returning to FIG. 2, the coarse localization process of 5 coordinate system and then translated by (x x l y Q T . Finally, 

block 216 is used to initialize the process of fine alignment this world point is converted to image coordinates using p 

while accounting for the geometric and photometric trans- and K. 

formations between the video and reference imagery. In The seven parameters that are estimated by the algorithm 
general, the transformation between two views of a scene are the parameters in t, three degrees of freedom in R, and 
can be modeled by (i) an external coordinate transformation 10 p. These model parameters are referred to as the B vector, 
that specifies the 3D alignment parameters between the The K matrix is assumed to be accurately known, 
reference and the camera coordinate systems, and (ii) an More specifically, the method 500 begins at step 502 and 
internal camera coordinate system to image transformation proceeds to step 504 where a video frame image (I) and a 
that typically involves a linear (afline) transformation and reference frame image (R) are selected for alignment. At 
non-linear lens distortion parameters. The fine alignment 15 step 506, the method 500 defines initial model parameters 
block 222 jointly computes the external coordinate trans- (B) for each pixel (x). At step 508, the method minimizes an 
formation and the linear internal transformation. This, along alignment error as described more fully below. At step 510, 
with the depth image and the non-linear distortion the method 500 queries whether the error minimization is 
parameters, completely specifies the alignment transforma- complete. If the query is negatively answered, the method 
tion between the video pixels and those in the reference 20 recomputes the error. The operation of steps 508 and 510 
imagery. The modeled video-to-reference transformation is forms an iterative process that achieves image alignment, 
applied to the solution of the precise alignment problem. The Once aligned, the world points used in the computation are 
process involves simultaneous estimation of the unknown converted to image coordinates at step 512. Lastly, the 
transformation parameters as well as the warped reference alignment parameters are produced at step 514. 
imagery that precisely aligns with the video imagery. Multi- 25 Direct methods for parametric motion estimation mini- 
resolution coarse-to-fine estimation and warping with mize an image error measure over the video frame by 
Gaussian/Laplacian pyramids is employed. iteratively updating the model parameters. The direct 
FIG. 4 depicts a process 400 by which the geo-registration method defines the error E at a particular pixel x on the 
program can select the alignment technique that best suits reference and for particular model parameters B as 
the image content. This process is implemented as a model 30 

selector 252 of FIG. 2. The video mosaic is input at step 402 ^tMWfcw t ) 

and the reference image is input at step 404. The process where I is the video frame, R is the reference, and g is the 

queries, at step 406, whether the DEM data is available and Euclidean model. The error that is minimized is the sum of 

reliable. If the query is affirmatively answered, the process squared differences (SSD). Thus, the error for the particular 

proceeds to step 408 where the Euclidean model (described 35 model parameters B is 

below) is used to accurately align the video mosaic and the (e(Bx)) 2 (a) 

reference image. On the other hand, if the query at step 406 I >- *t C >*)) 

is negatively answered, the process proceeds to step 410. At The Gauss-Newton method minimizes this non-linear 

step 410, the process queries whether the terrain is locally error function by repeating the following least squares 

flat. If the query is affirmatively answered, the process 40 problem. From some current solution B=(t,R,p) the algo- 

proceeds to step 412 where the video mosaic is aligned with rithm moves to B^^t+At^Ar^p+Ap), where R((|>) 

the reference image using the "projective" algorithm. If the denotes rotation of |<j)| radians around the $ axis. E(B MH) ) is 

terrain is not locally flat, the process proceeds to step 414 approximated by linearizing E(B new ,x) with respect to the 

where the "plane+parallax" algorithm is used to align the increment p-(At, Ar,Ap) as 

video mosaic and the reference image. The "projective" and 45 # 

"plane+parallax" algorithms are described in detail in U.S. ^ ^ p p ~ 

patent application Ser. No. 08/493,632, filed Jun. 22, 1995, p is chosen to minimize Ep,^) by solving the linear least 

and incorporated herein by reference. squares problem 

FIG. 5 depicts a method 500 for determining alignment 

parameters using a Euclidean model. The Euclidean model 50 / dijdp 
of step 408 is defined according to the equation 

u^sHK * (f+diag (1 ,1 ,P) *R *xj) (2) (dl m /dp 

Here, K is the matrix modeling the internal parameters of 

the camera and any 2D projective image warps applied to the 55 where I f - means I(g(B,x i )) for the ith of m pixels in the region 

original video frame during pre-processing, except focal of interest. 

length is factored out to be estimated separately, u is the 2D In practice, the Levenberg-Marquardt modification to 

coordinate of a pixel in the preprocessed video frame, ji is Gauss-Newton is used for improved reliability. Furthermore, 

the perspective projection operation which takes a 3-vector the direct method is applied on a Laplacian Pyramid of the 

U to the 2-vector (VJU^VjlJ^). The model uses p=l/f (the 60 image and the reference going from coarser to finer levels, 

inverse of the focal length) to avoid computation difficulties while performing Levenberg-Marquardt iterations at each 

when f— in orthographic viewing. R is the rotation matrix, level of the pyramid. Methods for converting the parameters 

and x w is the 3D coordinates of a point in the world seen in from one level to the next lower level are employed to 

the reference image. The model uses a translation vector generate initial estimates of the parameters for the next 

t-(l x tylJf) T where L,.,^ and are the translations in the 65 lower level. 

camera coordinate system, since the image scale factor t^f Although the Euclidean model algorithm functions using 

remains finite even as t^f-**). the previous set of equations and finds the derivatives of u 



EXB, Xi) \ (5) 
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from the transformation equation for each of the parameters 
to be estimated, the algorithm is improved by normalizing 
the transformation equation to improve the numerical con- 
ditioning of the least squares problem and get greater 
accuracy in the numerical solution of the parameters. 5 
Therefore, the equations are described below for the nor- 
malized parameters rather than for the original equations 
since these are the ones which are actually used. 

The algorithm uses normalized u and x such that u^^ 
and x,,^ have zero mean and unit standard deviation. This 10 
will ensure that the parameters to be estimated are properly 
scaled for better numerical stability of the solution. 

The normalized transformation equation is as follows: 



sponding error. The other matrices are defined in the usual 
way as described for the case of unnormalized parameters. 

The image derivatives are determined with respect to the 
normalized parameters as follows. The derivative dl/dp^^, 
for a particular pixel x r in the reference image is 



di 



dl 

'Tu' 



du 



dU dU M 



dPnt 



where the image gradient 



dl . 



where 



(6) 



CO 



15 



and the gradients 



(«) 20 



dU - U h {0 l~u y j 



Ox-iy the standard deviation of jt(K _1 U) calculated over 
the video frame coordinates. To define the standard devia- 
tion of a vector u, the model uses 

a>„ A 2 (») 

which is the natural definition of standard deviation using 
the Euclidean distance to measure distance between two 
vectors. 

x„ is the point location in the world coordinate system and 
x„„„ is the mean value of the x w and is calculated for each 

m eon *v 

frame over the region of interest in the reference. o x is the 
standard deviation of over the same region. o x is defined 
in the same way as before (i.e., square root of the sum of the 
squares of the three components). 

By equating (2) and (6), parameters can be found as 



du 



= K*diag(l, 1, l/o~x-w)> 



(13) 



(14) 



(15) 



(16) 



25 



for each of the p^™ param- 
eters the change in V nonn is computed when one of the 
parameters is perturbed keeping the others zero. Thus, 



30 



and 



dt„ 



dU m 



zdiagiL t. 1.) 



35 



■=diag(0, 0, i)Rx tK ,„ 



(17) 



(18) 



(10) 



and 



40 



1 (11) 



It is known that 

where notation [w^ means the skew-symmetric cross- 
matrix of a 3-vector 5 w: 



50 



^ . 45 

The rotation matrix remains unchanged. These equations 

and their converse (i.e., unnormalized parameters in terms of 

normalized parameters) are used for transformation from 

one set of parameters to the other. Hence 

The normalized parameters are solved in much the same 
way as for the unnormalized ones as outlined before. 
However, there are some differences. The full equations 
needed to calculate the derivatives of u with respect to the 
modeled parameters are presented below. 

Let P BWm *(At flopm> Ar, Ap nomi ). The general equation that 55 
is used to calculate the updates to the parameters is 



f 0 ~W Z Wy 

Vy Wr 0 



(19) 



d 



(20) 



di 



E(B nomKXi ) 



(12) 



60 



where the errors with respect to normalized B parameters 6S 
can be calculated easily by converting them to the corre- 
sponding unnormalized parameters and taking the corre- 



Combining all these derivatives, the model derives the final 
3x7 matrix defining dU^/dp^. 

By combining the results for the different matrices, a row 
of the dI/dp„ orm matrix for each of the pixels in the region 
of interest in the reference image is determined. Thus, the 
algorithm generates the mx7 matrix required in equation 
(11) which is then used in the Levenberg-Marquardt algo- 
rithm for each level of the Laplacian pyramid to find the 
Euclidean parameters in a coarse to fine manner. 

The Euclidean Algorithm is an iterative update algorithm 
which requires a good initial estimate of the parameters for 
it to work. Therefore, the algorithm is used as part of a 
complete geo -registration system rather than as a stand- 
alone algorithm. 
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In a geo-registration system, generally the task is to 
register every frame of the whole video sequence to the 
reference image data set — orthophoto and DEM. Additional 
inputs may include ESD for some frames or, equivalently, 
points correspondences picked by a human operator. In 
order to accomplish this task efficiently, accurately, and 
reliably, the technique exploits in addition to coarse-to-fine 
in scale, progressive model complexity and temporal con- 
tinuity. 

In progressive model complexity, e.g., translation to affine 
to projective 2D to Euclidean, the registration result of each 
stage becomes the initial seed for the next stage. This is 
useful because searching over a large range of uncertainty is 
easier with a low order model, while increasing the model 
order too quickly may fail when the practical convergence 
range of the Euclidean algorithm is exceeded by the regis- 
tration error of low order estimates. 

Indeed, the initial pose uncertainty can be large when the 
ESD for a frame is very rough or missing and no temporal 
prediction is available. Coarse alignment block 216 aligns 
the video and reference image to within a few pixels. 

Initializing a given stage's parameters from a prior stage 
model is generally straightforward-just assume zero for the 
parameters not previously estimated. However, the projec- 
tive 2D model is converted to the Euclidean model (t,R,p) 
that best matches the projective 2D mapping from reference 
to video for points that lie on the local planar approximation 
of the DEM. 

Temporal continuity may be exploited to avoid the 
expense of full progressive complexity on every frame to be 
geo registered, or when a frame has no ESD of its own. One 
approach is to initialize a model not from a next lower order 
model estimated in the current frame but based on the same 
model in the previous frame, cascaded with an interframe 
transformation. This process is a form of tracking that is 
performed by the tracking block 220 of FIG. 2 where a 
previously computed parametric transformation associated 
with a prior frame is used to initialize (or seed) a parametric 
transformation of the present frame. For example, given 
Euclidean parameters (t, R, p) for the previous frame, those 
parameters could be connected into projective parameters 
that match for points on the local planar approximation of 
the DEM, cascaded with the interframe projective 
transformation, then connected to the Euclidean parameters 
by the method given above. Another approach is to build a 
local mosaic out of a sequence of adjacent frames — again 
using interframe information — and geo -registering the 
mosaic instead of individual frames. This has the added 
benefit of collecting a larger spatial context, since a single 
frame might not have enough features that match well to the 
reference image. For both approaches, adjacent video frames 
need to be aligned with respect to each other using some 
parametric model. This can be done reliably even if features 
are too sparse for georegistration because the interframe 
change of appearance and geometry is small. 

The overall sequencing of operations constitutes the con- 
trol strategy of the complete georegistration system, and 
many variations can be devised by those skilled in the art. 

Once, the alignment parameters have been computed, the 
display generation module 206 can warp the reference image 
to the video image or vice versa, accurately overlay certain 
reference information onto the video image, and the like. In 
one embodiment of the invention the video images can be 
warped to the reference image. These video images can then 
be merged to construct geo-mosaics (geo-referenced video 
mosaics). These mosaics can be used to update the reference 
imagery. The video mosaics that are warped in this manner 
are coupled to the reference database module 204 along path 
250 in FIG. 2. 

For annotation and other visualization tasks, it is impor- 
tant for the user to be able to map points from the video to 
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the reference image and vice versa. Similarly for warping 
the video image to the reference image, the invention can 
use reverse warping with bilinear interpolation. However, to 
warp the reference image to appear in the video image 

s coordinates, the invention must use forward warping. Point 
mappings in the forward warping process are computed 
using the above technique. 

To accomplish the overlay process, the display module 
contains a video mixer that combines the video and database 
information on the operator's reference overview image 

10 monitor window. Additionally, the overlay may include 
attached sound/text references for point and click recall of 
additional information pertaining to the imaged scene. As 
such, the video images can be mapped to the geo -spatial 
reference information or the geo-spatial reference informa- 
tion can be mapped to the video. This flexible integration of 

15 currently generated video and database information provides 
a rich source of accessible information for many applica- 
tions. 

The annotation information, the video and the transfor- 
mation parameters are provided to the display module 206. 

20 The display module produces various user defined display 
formats. For example, the video or video mosaic can be 
overlaid atop the reference images or vice versa, annotation 
data can be overlaid atop the video, reference images and 
annotation information can both be overlaid atop the video 

25 images and so on. The database may further contain DEM 
maps and multi-modal annotations such as graphics, text, 
audio, maps and the like. Additionally, the video view can be 
used as an initial view of a scene, then the reference database 
imagery could be displayed as a synthetic view of the scene 
extrapolated from the original view as a virtual camera 

30 moves from the initial view through the scene, i.e., a 
synthetic "fly through." 

Furthermore, objects in the video can be identified by the 
user and marked with a cursor. The system can accurately 
geo-locate the selected point with respect to the reference 

35 coordinate system. 

Although various embodiments which incorporate the 
teachings of the present invention have been shown and 
described in detail herein, those skilled in the art can readily 
devise many other varied embodiments that still incorporate 

40 these teachings. 

What is claimed is: 

1. A system for performing geo-spatial registration of an 
input image and geographically calibrated reference imag- 
ery and a digital elevation map (DEM) comprising: 

4 5 a reference database module containing geographically 
calibrated reference imagery and DEM, for producing 
geographically calibrated reference imagery and DEM 
relating to imagery in said input image; and 
an alignment module, coupled to said reference database 

50 module, for aligning said input image to said geo- 
graphically calibrated reference imagery and DEM 
using a Euclidean model. 

2. The system of claim 1 further comprising: 

a mosaic generation module for producing an image 
55 mosaic as said input image, where said image mosaic 
is generated from a sequence of sensor images. 

3. The system of claim 1 further comprising: 

a source of sensor attitude for generating attitude infor- 
mation pertaining to a sensor producing said input 
60 image. 

4. The system of claim 3 wherein said alignment module 
further comprises: 

a coarse alignment block, coupled to said reference data- 
base module and said sensor attitude source, for align- 
65 ing said geographically calibrated reference imagery 
and DEM to alignment with said input image using said 
sensor attitude. 
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5. The system of claim 1 wherein said alignment module 
further comprises: 

a coarse alignment block, coupled to said reference data- 
base module, for aligning said geographically cali- 
brated reference imagery and DEM. 5 

6. The system of claim 4 wherein said alignment module 
further comprises: 

a fine alignment block, coupled to said coarse alignment 
block, for accurately aligning said input image to said 
geographically calibrated reference imagery and DEM io 
to a sub-pixel accuracy. 

7. The system of claim 3 wherein said alignment module 
further comprises a synthetic view generation block for 
warping said geographically calibrated reference imagery 
and DEM to have a viewpoint similar to a viewpoint of said 35 
sensor. 

8. The system of claim 3 wherein said sensor is a video 
camera. 

9. The system of claim 5 further comprising a tracking 
block, coupled to said fine alignment block, for using a 2Q 
previously computed parametric transformation associated 
with a prior frame as an initialization transformation for 
computing a parametric transformation for a present frame. 

10. The system of claim 5 further comprising a tracking 
block, coupled to said fine alignment block, for tracking said 
parametric transformation such that a new parametric trans- 
formation does not have to be computed for each new input 
image. 

11. The system of claim 1 further comprising a display 
module for generating a display that uses the parametric 
transformation to align said geographically calibrated refer- 
ence imagery with said input image and simultaneously 
display said geographically calibrated reference imagery 
and said input image, 

12. The system of claim 4 wherein said coarse alignment 
block further comprises: 

a input image salient feature extractor, 
a reference image salient feature extractor; 
an exhaustive search engine; and 
a directed matching processor. 

13. The system of claim 5 wherein said fine alignment 
block aligns imagery using the Euclidean model: 

««?i(/i:*(f+diag (l t l#)*R*xJ) where 

t is a translation vector, (3 is an inverse focal length, R is 
a rotational matrix, X^, are 3D coordinates, and K are 
camera parameters. 

14. A method for performing geo -spatial registration of an 
input image and geographically calibrated reference imag- 
ery and digital elevation map (DEM) comprising: 

producing geographically calibrated reference imagery 
and DEM relating to imagery in said input image; and 

aligning said input image to said geographically cali- 
brated reference imagery and DEM using a Euclidean 
model. 55 

15. The method of claim 14 further comprising the step of: 
producing an image mosaic as said input image, where 

said image mosaic is generated from a sequence of 
sensor images. 

16 . The method of claim 14 further comprising the step of: 60 
generating attitude information pertaining to a sensor 

producing said input image. 

17. The method of claim 16 wherein said aligning step 
further comprises the step of: 

coarsely aligning said geographically calibrated reference 65 
imagery and DEM to alignment with said input image 
using said sensor attitude. 
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18. The method of claim 17 wherein said aligning step 
further comprises the step of: 

accurately aligning said input image to said geographi- 
cally calibrated reference imagery and DEM to a sub- 
pixel accuracy. 

19. The method of claim 17 wherein said aligning step 
further comprises warping said geographically calibrated 
reference imagery and DEM to have a viewpoint similar to 
a viewpoint of said sensor. 

20. The method of claim 18 further comprising the step of: 
tracking said parametric transformation such that a new 

parametric transformation does not have to be com- 
puted for each new input image. 

21. The method of claim 14 further comprising the step of: 
generating a display that uses the parametric transforma- 
tion to align said geographically calibrated reference 
imagery with said input image and simultaneously 
display said geographically calibrated reference imag- 
ery and said input image. 

22. The method of claim 18 wherein said accurate align- 
ing step further comprises the step of: 

selecting an alignment process best suited for the input 
image. 

23. The method of claim 14 further comprising the step of: 
determining geographic coordinates of a user selected 

point within said input image. 

24. The method of claim 14 further comprising the step of: 
compensating for lens distortion associated with the input 

image. 

25. The method of claim 14 further comprising the step of: 
updating the geographically calibrated reference imagery 

in the reference database with information from the 
input image. 

26. The method of claim 14 further comprising the step of: 
generating a synthetic fly through starting at a current 

viewpoint of the input image and continuing using the 
geographically calibrated reference imagery. 

27. The method of claim 17 wherein said coarsely align- 
ing step further comprises: 

extracting salient features from the input image; 
extracting salient features from the a reference image; 
exhaustively searching the salient features of the input 

image and the reference image; and 
identifying matching salient features in the input image 

and the reference image. 

28. The method of claim 18 wherein said accurately 
aligning aligns imagery using the Euclidean model: 

(r+diagCUP)**** J) where 

t is a translation vector, p is an inverse focal length, R is 
a rotational matrix, are 3D coordinates, and K are 
camera parameters, 

29. A digital storage medium containing a computer 
program that, when executed by a general purpose computer, 
forms a specific purpose computer that performs the steps 
of: 

producing geographically calibrated reference imagery 
and digital elevation map (DEM) relating to imagery in 
said input image; and 

aligning said input image to said geographically cali- 
brated reference imagery and DEM using a Euclidean 
model. 

30. The medium of claim 29 further performing the step 
of: 

producing an image mosaic as said input image, where 
said image mosaic is generated from a sequence of 
sensor images. 
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